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I— j ' Abstract 

O . We provide a formal, simple and intuitive theory of rational decision mak- 

ing including sequential decisions that affect the environment. The theory has 
a geometric flavor, which makes the arguments easy to visualize and under- 
stand. Our theory is for complete decision makers, which means that they 



> 

o 



{Sj \ have a complete set of preferences. Our main result shows that a complete 

rational decision maker implicitly has a probabilistic model of the environ- 
ment. We have a countable version of this result that brings light on the issue 



o 



of countable vs finite additivity by showing how it depends on the geome- 
try of the space which we have preferences over. This is achieved through 
fruitfully connecting rationality with the Hahn-Banach Theorem. The theory 
presented here can be viewed as a formalization and extension of the betting 
k>< | odds approach to probability of Ramsey and De Finetti jRam31~| ldeF37| . 
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1 Introduction 

We study complete decision makers that can take a sequence of actions to rationally 
pursue any given task. We suppose that the task is described in a reinforcement 
learning framework where the agent takes actions and receives observations and 
rewards. The aim is to maximize total reward in some given sense. 

Rationality is meant in the sense of internal consistency |Sug91|, which is how 



it has been used in |NM44j and |Sav54j . In [NM44J, it is proven that preferences 
together with rationality axioms and probabilities for possible events imply the 
existence of utility values for those events that explain the preferences as arising 
through maximizing expected utility. Their rationality axioms are 

1. Completeness: Given any two choices we either prefer one of them to the other 
or we consider them to be equally preferable; 

2. Transitivity: A preferable to B and B to C imply A preferable to C; 

3. Independence: If A is preferable to B and t G [0, 1] then tA + (1 — t)C is 
preferable (or equal) to tB + (1 — t)C\ 

4. Continuity: If A is preferable to B and B to C then there exists t G [0, 1] such 
that B is equally preferable to tA + (1 — t)C. 

In |Sav54] the probabilities are not given but it is instead proven that preferences 
together with rationality axioms imply the existence of probabilities and utilities. 
We are here interested in the case where one is given utility (rewards) and preferences 
over actions and then deriving the existence of a probabilistic world model. We put 
an emphasis on extensions to sequential decision making with respect to a countable 
class of environments. We set up simple axioms for a rational decision maker, which 
implies that the decisions can be explained (or defined) from probabilistic beliefs. 

The theory of |Sav54j is called subjective expected utility theory (SEUT) and was 
intended to provide statistics with a strictly behaviorial foundation. The behavioral 
approach stands in stark contrast to approaches that directly postulate axioms that 
"degrees of belief should satisfy |Cox46t IHal99t Jay03| . Cox's approach |Cox46[ 



Jay03| has also been found |Par94] to need additional technical assumptions in 



addition to the common sense axioms originally listed by Cox. The original proof by 
|Cox46j has been exposed as not mathematically rigorous and his theorem as wrong 
|Hal99] . An alternative approach by |Ram31t ldeF37] is interpreting probabilities as 
fair betting odds. 

The theory of |Sav54j has greatly influenced economics Sug91 where it has been 



used as a description of rational agents. Seemingly strange behavior was explained 
as having beliefs (probabilities) and tastes (utilities) that were different from those 
of the person to whom it looked irrational. This has turned out to be insufficient as 
a description of human behavior |A1153|. IE1161] and it is better suited as a normative 
theory or design principle in artificial intelligence. In this article, we are interested 



in studying the necessity for rational agents (biological or not) to have a probabilis- 
tic model of their environment. To achieve this, and to have as simple common 
sense axioms of rationality as possible, we postulate that given any set of values (a 
contract) associated with the possible events, the decision maker needs to have an 
opinion on wether he prefers these values to a guaranteed zero outcome or not (or 
equal). From this setting and our other rationality axioms we deduce the existence 
of probabilities that explain all preferences as maximizing expected value. There 
is an intuitive similarity to the idea of explaining/deriving probabilities as a book- 
maker's betting odds as done in |deF37] and |Ram31] . One can argue that the theory 
presented here (in Section |2]) is a formalization and extension of the betting odds 
approach. Geometrically, the result says that there is a hyper-plane in the space of 
contracts that separates accept from reject. We generalize this statement, by using 
the Hahn-Banach Theorem, to the countable case where the set of hyper-planes (the 
dual space) depends on the space of contract. The answers for different cases can 
then be found in the Banach space theory literature. This provides a new approach 
to understanding issues like finite vs. countable additivity. We take advantage of 
this to formulate rational agents that can deal successfully with countable (possibly 
universal as in all computable environments) classes of environments. 

Our presentation begins in Section [2] by first looking at a fundamental case where 
one has to accept or reject certain contracts defining positive and negative rewards 
that depend on the outcome of an event with finitely many possibilities. To draw 
the conclusion that there are implicit unique probabilistic beliefs, it is important 
that the decision maker has an opinion (acceptable, rejectable or both) on every 
possible contract. This is what we mean when we say complete decision maker. 

In a more general setting, we consider sequential decision making where given 
any contract on the sequence of observations and actions, the decision maker must 
be able to choose a policy (i.e. an action tree). Note that the actions may affect the 
environment. A contract on such a sequence can e.g. be viewed as describing a re- 
ward structure for a task. An example of a task is a cleaning robot that gets positive 
rewards for collecting dust and negative for falling down the stairs. A prerequisite 
for being able to continue to collect dust can be to recharge the battery before run- 
ning out. A specialized decision maker that deals only with one contract/task does 
not always need to have implicit probabilities, it can suffice with qualitative beliefs 
to take reasonable decisions. A qualitative belief can be that one pizza delivery com- 
pany (e.g. Pizza Hut vs Dominos) is more likely to arrive on time than the other. 
If one believes the pizzas are equally good and the price is the same, we will chose 
the company we believe is more often delivering on time. Considering all contracts 
(reward structures) on the actions and events, leads to a situation where having a 
way of making rational (coherent) decisions, implies that the decision maker has 
implicit probabilistic beliefs. We say that the probabilities are implicit because the 
decision maker, which might e.g. be a human, a dog, a computer or just a set of 
rules, might have a non-probabilistic description of how the decisions are made. 

In Section [31 we investigate extensions to the case with countably many possible 



outcomes and the interesting issue of countable versus finite additivity. Savage's 
axioms are known to only lead to finite additivity while |Arr70] showed that adding 
a monotone continuity assumption guarantees countable additivity. We find that in 
our setting, it depends on the space of contracts in an interesting way. In Section 
HI we discuss a setting where we have a class of environments. 

2 Rational Decisions for Accepting or Rejecting 
Contracts 

We consider a setting where we observe a symbol (letter) from a finite alphabet and 
we are offered a form of bet we call a contract that we can accept or not. 

Definition 1 (Passive Environment, Event) A passive environment is a se- 
quence of symbols (letters) jt, called events, being presented one at a time. At 
time t the symbols ji,---,jt are available. We can equivalently say that a passive 
environment is a function v from finite strings to {0, 1} where v(ji, ■■■,jt) — 1 if and 
only if the environment begins with ji, ..., j t . 

Definition 2 (Contract) Suppose that we have a passive environment with sym- 
bols from an alphabet with m elements. A contract for an event is an element 
x = (xi, ...,x m ) in M. m and Xj is the reward received if the event is the y.th symbol, 
under the assumption that the contract is accepted (see next definition). 

Definition 3 (Decision Maker, Decision) A decision maker (for some unknown 
environment) is a set Z C W 71 which defines exactly the contracts that are accept- 
able. In other words, a decision maker is a function from M. m to {accepted, rejected, 
either} . The function value is called the decision. 

If x G Z and A > then we want \x a Z since it is simply a multiple of the same 
contract. We also want the sum of two acceptable contracts to be acceptable. If we 
cannot lose money we are prepared to accept the contract. If we are guaranteed to 
win money we are not prepared to reject it. We summarize these properties in the 
definition below of a rational decision maker. 

Definition 4 (Rationality I) We say that the decision maker (Z C W 71 ) is ratio- 
nal if 

1. Every contract x G W 1 is either acceptable or rejectable or both; 

2. x is acceptable if and only if —x is rejectable; 

3. x, y G Z , A, 7 > then Xx + jy G Z ; 

4- If x k > VA; then x = (xi, ...,x m ) G Z while if Xk < V/c then x <£ Z . 



If we want to compare these axioms to rationality axioms for a preference relation 
on contracts we will say that x is better or equal (as in equally good) than y if x — y 
is acceptable while it is worse or equal if x — y is rejectable. The first axiom is 
completeness. The second says that if x is better or equal than y then y is worse 
or equal to x. The third implies transitivity since [x — y) + (y — z) = (x — z). The 
fourth says that if x has a better (or equal) reward than y for any event, then x is 
better (or equal) than y. 

2.1 Probabilities and Expectations 

Theorem 5 (Existence of Probabilities) Given a rational decision maker, 
there are numbers pi > that satisfy 

{x | ^XiPi >0} C Z C{x \ ^XiPi > 0}. (1) 

Assuming YliPi = 1 makes the numbers unique and we will use the notation Pr(i) = 
Pi- 

Proof. See the proof of the more general Theorem 1231 It tells us that the closure Z 
of Z is a closed half space and can be written as {x \ J2 x iPi — 0} f° r some vector 
p = (pi) (since every linear functional on IR m is of the form f{x) = Yl x iPi) an d not 
every pi is 0. The fourth property tells us that pi > Vi. 

Definition 6 (Expectation) We will refer to the function g(x) = ^2piXi from ([I]) 
as the decision makers expectation. In this terminology, a rational decision maker 
has an expectation function and accepts a contract x if g(x) > and reject it if 
g(x) < 0. 

Remark 7 Suppose that we have a contract x = (xi) where X{ = 1 for all i. If we 
want g(x) = 1, we need J^P* = 1- 

We will write E(x) instead of g(x) (assuming ^pi = 1) from now on and call it the 
expected value or expectation of x. 

2.2 Multiple Events 

Suppose that the contract is such that we can view the symbol to be drawn as 
consisting of two (or several) symbols from smaller alphabets. That is we can write 
a drawn symbol as (i,j) where all the possibilities can be found through 1 < i < m, 
1 < j < ""■• In this way of writing, a contract is defined by real numbers Xij. 
Theorem [5] tells us that for a rational decision maker there exists unique ry > 
such that J2ij r i,j = 1 an d an expectation function g(x) = Yli r id x i,j sucn t na t 
contracts are accepted if g(x) > and rejected if g(x) < 0. 



2.3 Marginals 

Suppose that we can take rational decisions on bets for a pair of horse races, while 
the person that offers us bets only cares about the first race. Then we are still 
equipped to respond since the bets that only depend on the first race is a subset of 
all bets on the pair of races. 

Definition 8 (Marginals) Suppose that we have a rational decision maker (Z) for 
contracts on the events (i,j)- Then we say that the marginal decision maker for the 
first symbol (Z x ) is the restriction of the decision maker Z to the contracts Xij that 
only depend on i, i.e. Xij = Xi- In other words given a contract y = {yi) on the first 
event, we extend that contract to a contract on (i,j) by letting y it j = yi and then the 
original decision maker can decide. 

Suppose that Xij = X{. Then the expectation Yl r i,j x i,3 can be rewritten as 
Y^Pi x i where Pi = V . r^j. We write that 

Pr{i) = ^Pr{i,j). 

3 

These are the marginal probabilities for the first variable that describe the marginal 
decision maker for that variable. Naturally we can also define a marginal for the 
second variable (considering contracts x^j = x 3 ) by letting q 3 - = J2i r i,j an d Pr{j) = 
Y^i Pr(i,j). The marginals define sets Z\ C M m and Zi C M. n of acceptable contracts 
on the first and second variables separately. 

2.4 Conditioning 

Again suppose that we are taking decisions on bets for a pair of horse races, but this 
time suppose that the first race is already over and we know the result. We are still 
equipped to respond to bets on the second race by extending the bet to a bet on 
both where there is no reward for (pairs of) events that are inconsistent with what 
we know. 

Definition 9 (Conditioning) Suppose that we have a rational decision maker (Z) 
for contracts on the events (i,j). We define the conditional decision maker Zj = j 
for i given j = jo by restricting the original decision maker Z to contracts x^j which 
are such that xy = if j ^ jo- I n other words if we start with a contract y = (jji) 
on i we extend it to a contract on (i,j) by letting y^j = yi and yij = if j ^ jo- 
Then the original decision maker can make a decision for that contract. 

Suppose that x^j = if j ^ jo- The unconditional expectation of this contract 
is Yliij r id x i,3 as usua l which equals Ylii r i,3o x i,3o- This leads to the same decisions 
(i.e. the same Z) as using ^\ ^ l ' 30 Xij which is of the form in Theorem We 
write that 

{l30) E k Pr(k,Jo) Pr(j ) • {) 
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From this it follows that 

Pr(io)Pr(j \io) = Pr{j )Pr{i Q \ ]o ) (3) 

which is one way of writing Bayes rule. 

2.5 Learning 

In the previous section we defined conditioning which lead us to a definition of what 
it means to learn. Given that we have probabilities for events that are sequences of 
a certain number of symbols and we have observed one or several of them, we use 
conditioning to determine what our belief regarding the remaining symbols should 
be. 

Definition 10 (Learning) Given a rational decision maker, defined by Pi ly ... y i T for 
the events (it)J=i and the first t—1 symbols ii, ...,it-i, we define the informed rational 
decision maker for i t by conditioning on the pastii, ...,it-i and marginalize over the 
future i t +i, ..., it- Formally, 

informed/ -\ _ p /•! • • \ l-ij t+1 ,...,j T Pil,-,h,jt+l,-,JT 



P^ maM (i) = Pr(i\i 1 ,...,U) 



Uv-JT^I' 



,H-U3t,—,JT 



2.6 Choosing between Contracts 

Definition 11 (Choosing contract) We say that to rationally prefer contract x 
over y is (equivalent) to rationally consider x — y to be acceptable. 

As before we assume that we have a decision maker that takes rational decisions on 
accepting or rejecting contracts x that are based on an event that will be observed. 
Hence there exist implicit probabilities that represent all choices and an expectation 
function. Suppose that an agent has to choose between action a± that leads to 
receiving reward Xj if i is drawn and action a<i that leads to receiving y^ in the case 
of seeing i. Let Zj = Xi — y,. We can now go back to choosing between accepting and 
rejecting a contract by saying that choosing (preferring) a± over a 2 means accepting 
the contract z. In other words if E(x) > E(y) choose a\ and if E(x) < E(y) choose 
a 2 . 

Remark 12 We note that if we postulate that choosing between contract x and the 
zero contract is the same as choosing between accepting or rejecting x, then being 
able to choose between contracts implies the ability to choose between accepting and 
rejecting one contract. We, therefore, can say that the ability to choose between a 
pair of contracts is equivalent to the ability to choose to accept or reject a single 
contract. 



We can also choose between several contracts. Suppose that action a^ gives us 
the contract x = (x k )iL 1 . If E(x^) > E(x k ) VA; ^ j then we strictly prefer a,j over 
all other actions. In other words a contract x^ — x k would for all k be accepted and 
not rejected by a rational decision maker. 

Remark 13 If we have a rational decision maker for accepting or rejecting con- 
tracts, then there are implicitly probabilities Pi for symbol i that characterize the 
decisions. A rational choice between actions a^ leading to contracts x k is taken by 
choosing action 

a* = argmax> PiX k . (4) 



2.7 Choosing between Environments 

In this section, we assume that the event that the contracts are concerned with 
might be affected by the choice of action. 

Definition 14 (Reactive environment) An environment is a tree with symbols 
j t (percepts) on the nodes and actions a t on the edges. We provide the environment 
with an action a t at each time t and it presents the symbol j t at the node we arrive 
at by following the edge chosen by the action. We can also equivalently say that a 
reactive environment v is a function from strings a\j\, ..., atjt to {0, 1} which equals 
1 if and only if v would produce ji, ...,jt given the actions 01, ..., o*. 

We will define the concept of a decision maker for the case where one decision 
will be taken in a situation where not only the contract, but also the outcome can 
depend on the choice. We do this by defining the choice as being between two 
different environments. 

Definition 15 (Active decision maker) Consider a choice between having con- 
tract x for passive environment env\ or contract y for passive environment env2 ■ A 
decision maker is a set Z C W mi x R" 12 which defines exactly the pairs (x,y) for 
which we choose env\ with x over env 2 with y. 

Definition 16 (Rational active choice) To choose between action a± with con- 
tract x and 02 with contract y in a situation where the action may affect the event, 
we consider two separate environments, namely the environments that result from 
the two different actions. We would then have a situation where we will have one 
observation from each environment. Preferring a% with x to a 2 with y is (equivalent) 
to consider x — y to be an acceptable contract for the pair of events. 

Remark 17 Definition [76| means that a\ with x is preferred over a 2 with y if 01 
with x — y is preferred over a 2 with the zero contract. 

8 



Proposition 18 (Probabilities for reactive setting) Suppose that we have a 
reactive environment and a rational active decision maker that will make one choice 
between action a% and a 2 as described in Definitions [73] and [T5[ then there exist 
Pi > and qi > such that action a% with contract x is preferred over action a 2 
with contract y ifY^Pi x i > Y^QiVi an d ^ e r ^ v Grse ifJ2Pi x i < ^QiVi- This means 
that the decision maker acts according to probabilities Pr(- |oi) and Pr(-\a,2). 

Proof. Let Z be all contracts that when combined with action a% is preferred 
over a 2 with the zero contract. Theorem [I] guarantees the existence of pi such that 
J2Pi x i > implies that x e Z and J2P* x i < implies that x ^ Z. The same 
way we find qi that describe when we prefer a 2 with y to a\ with the zero contract. 
That these probabilities (pi and qv) explain the full decision maker as stated in the 
proposition now follows directly from Definition [TH] understood as in Remark [T71 

Suppose that we are going to make a sequence of T < oo decisions where at 
every point of time we will have a finite number of actions to chose between. We 
will consider contracts, which can pay out some reward at each time step and that 
can depend on everything (actions chosen and symbols observed) that has happened 
up until this time and we want to maximize the accumulated reward at time T. 

We can view the choice as just making one choice, namely choosing an action 
tree. We will sometimes call an action tree a policy. 

Definition 19 (Action tree) An action tree is a function from histories of sym- 
bols ji, ...,jt and decisions a\, ..., a t _i to new decisions, given that the decisions were 
made according to the function. Formally, 

f(ai,jx,...,a t -i,jt-i) = at- 

An action tree will assign exactly one action for any of the circumstances that 
one can end up in. That is, given the history up to any time t < T of actions and 
events, we have a chosen action. We can, therefore, choose an action tree at time 
and receive a total accumulated reward at time T. This brings us back to the 
situation of one event and one rational choice. 

Definition 20 (Sequential decisions) Given a rational decision maker for the 
events (jt)f = i and the first t — 1 symbols ji,...,jt-i and decisions ai,...,a t -\, we 
define the informed rational decision maker at time t by conditioning on the past 

ai,ji...,at-i,jt-i- 

Proposition 21 (Beliefs for sequential decisions) Suppose that we have a re- 
active environment and a rational decision maker that will take T < oo decisions. 
Furthermore, suppose that the decisions < t < T have been taken and resulted in 
history d\,j\..., dt-i,jt-i- Then the decision makers preferences at this time can be 
explained (through expected utility maximization) by probabilities 

Pr(j t ,...,JT\ai,ji...,a t _ 1 ,jt-i,at,a t+1 ...,a T ). 



Proof. Definition [20] and Proposition [18] immediately lead us to the conclusion 
that given a past up to a point t — 1 and a policy for the time t to T we have 
probabilistic beliefs over the possible future sequences from time t to T and the 
choice is categorized by maximizing expected accumulated reward at time T. 

3 Countable Sets of Events 

Instead of a finite set of possible outcomes, we will in this section assume a countable 
set. We suppose that the set of contracts is a vector space of sequences Xk, k = 
0, 1,2, ... where we use pointwise addition and multiplication with scalar. We will 
define a space by choosing a norm and let the space consist of the sequences that 
have finite norm as is common in Banach space theory. If the norm makes the 
space complete it is called a Banach sequence space |Die84j . Interesting examples 
are £°° of bounded sequences with the maximum norm ||(afc)||oo = max|a!fc|, cq of 
sequence that converges to equipped with the same maximum norm and £ p which 
for 1 < p < oo is defined by the norm 

\\(<*k)\\ P = (E\<xk\*) 1/p - 

For all of these spaces we can consider weighted versions (wk > 0) where 

||(a*)||p,to* = [| (afcW fc ) ||p. 

This means that a G £ p {w) iff {a^Wk) G £ p , e.g. a G £°°(w) iff sup fc |afcu;fc| < oo. 
Given a Banach (sequence) space X we use X' to denote the dual space that consists 
of all continuous linear functionals / : X — y R. It is well known that a linear 
functional on a Banach space is continuous if and only if it is bounded, i.e. that 
there is C < oo such that Hyp < C Vx G X. Equipping X' with the norm 

H/ll = sup ii^ji makes it into a Banach space. Some examples are (£ 1 )' = £°°, 
Cq = i 1 and for 1 < p < oo we have that (£ p )' = £ q where 1/p + 1/q = 1. These 
identifications are all based on formulas of the form 



/(*) = E 



%iPi 



where the dual space is the space that (pi) must lie in to make the functional 
both well defined and bounded. It is clear that £ x C (£°°)' but (i°°)' also contains 
"stranger" objects. 

The existence of these other objects can be deduced from the Hahn-Banach 
theorem (see e.g. |Kre89j or |NB97j ) that says that if we have a linear function 
defined on a subspace Y G X and if it is bounded on Y then there is an extension 
to a bounded linear functional on X. If Y is dense in X the extension is unique 
but in general it is not. One can use this Theorem by first looking at the subspace 
of all sequences in £°° that converge and let f{a) = lim^ooafc. The Hahn-Banach 

10 



theorem guarantees the existence of extensions to bounded linear functionals that 
are defined on all of £°°. These are called Banach limits. The space (£°°)' can be 
identified with the so called ba space of bounded and finitely additive measures with 
the variation norm \\u\\ = \v\(A) where A is the underlying set. Note that i 1 can be 
identified with the smaller space of countably additive bounded measures with the 
same norm. The Hahn-Banach Theorem has several equivalent forms. One of these 
identifies the hyper-planes with the bounded linear functionals |NB97j . 

Definition 22 (Rationality II) Given a Banach sequence space X of contracts, 
we say that the decision maker (subset Z of X defining acceptable contracts) is 
rational if 

1. Every contract x G X is either acceptable or rejectable or both; 

2. x is acceptable if and only if —x is rejectable; 

3. x, y G Z , A, 7 > then Xx + jy G Z ; 

4- If x k > VA; then x = (x^) is acceptable while if x^ > V/c then x is not 
rejectable. 

Theorem 23 (Linear separation) Suppose that we have a space of contracts X 
that is a Banach sequence space. Given a rational decision maker there is a positive 
continuous linear functional f : X — > R such that 

{x | f{x) > 0} C Z C {x | f(x) > 0}. (5) 

Proof. The third property tells us that Z and — Z are convex cones. The second 
and fourth property tells us that Z ^ M m . Suppose that there is a point x that 
lies in both the interior of Z and of — Z. Then the same is true for — x according 
to the second property and for the origin. That a ball around the origin lies in Z 
means that Z = R m which is not true. Thus the interiors of Z and — Z are disjoint 
open convex sets and can, therefore, be separated by a hyperplane (according to 
the Hahn-Banach theorem) which goes through the origin (since according to the 
second and fourth property the origin is both acceptable and rejectable). The first 
two properties tell us that Z U — Z = R m . Given a separating hyperplane (between 
the interiors of Z and — Z), Z must contain everything on one side. This means 
that Z is a half space whose boundary is a hyperplane that goes through the origin 
and the closure Z of Z is a closed half space and can be written as {x \ f(x) > 0} 
for some / G X'. The fourth property tells us that / is positive. 

Corollary 24 (Additivity) 1. If X = cq then a rational decision maker is de- 
scribed by a countably additive (probability) measure. 

2. If X = £°° then a rational decision maker is described by a finitely additive 
(probability) measure. 

11 



It seems from Corollary [2H that we pay the price of losing countable additivity 
for expanding the space of contracts from Co to £°° but we can expand the space 
even more by looking at cq(w) where Wk — > which contains £°° and X' is then 
£ 1 ((l/wfc)). This means that we get countable additivity back but we instead have 
a restriction on how fast the probabilities pk must tend to 0. Note that a bounded 
linear functional on cq can always be extended to a bounded linear functional on 
£°° by the formula f(x) = Y^Pi x i but that is not the unique extension. Note also 
that every bounded linear functional on £°° can be restricted to Cq and there be 
represented as f(x) = ^2,Vi x i- Therefore, a rational decision maker on £°° contracts 
has probabilistic beliefs (unless pi = Vi), though it might also take asymptotic 
behavior of a contract into account. For example (and here Pi = Vi), the decision 
maker that makes decisions based on asymptotic averages lin^^oo i Y17=i Xi wnen 
they exist. That strategy can be extended to all of £°° (a Banach limit). The 
following proposition will help us decide which decision maker on £°° is described 
with countably additive probabilities. 

Proposition 25 Suppose that f G (£°°)' . For any x G £°°, let x{ = x t if i < j and 
xj = otherwise. If for any x, 

lim f(x 3 ) = f(x), 

then f can be written as f{x) = ^PiXi where Pi > and 'Y^ = iPi < oo. 

Proof. The restriction of / to cq gives us numbers pi > such that 5^°^ p% < oo 
and f(x) = ^2piXi for x G Cq. This means that /(x J ') = ^2 3 i=1 PiXi for any x G £°° 
and j < oo. Thus lim^oo f(x J ) = J2'ZiPi x i- 

Definition 26 (Monotone decisions) We define the concept of a monotone deci- 
sion maker in the following way. Suppose that for every x G £°° there is iV < oo such 
that the decision is the same for all x\ j > N (See Proposition |25] for definition) as 
for x. Then we say that the decision maker is monotone. 

Example 27 Let f G £°° be such that if lima ^ — > L then f{ct) = L (i.e. f is 
a Banach limit). Furthermore define a rational decision maker by letting the set 
of acceptable contracts be Z = {x \ f(x) > 0}. Then f{x^) = (where we use 
notation from Proposition l25\) for all j < oo and regardless of which x we define 
x^ from. Therefore, all sequences that are eventually zero are acceptable contracts. 
This means that this decision maker is not monotone since there are contracts that 
are not acceptable. 

Theorem 28 (Monotone rationality) Given a monotone rational decision 
maker for £°° contracts, there are pi > such that J^ pi < oo and 

{x\J2xip i >0}cZC{x\J2x i pi>0}. (6) 
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Proof. According to Theorem [23] there is / G (£°°)' such that (the closure of 
Z) Z = {x\ f(x) > 0} . Let pi > be such that J^Pi < °° an d sucn that 
f( x ) = Y2 x iPi f° r x *= c o- Remember that x J (notation as in Proposition 125]) is 
always in c . Suppose that there is x such that x is accepted but ^XiPi < 0. This 
violate monotonicity since there exist N < oo such that X^=i X *P« < for all n > N 
and, therefore, x J is not accepted for j > N but x is accepted. We conclude that if 
x is accepted then J^ pjXj > and if J^ pjXj > then x is accepted. 

4 Rational Agents for Classes of Environments 

We will here study agents that are designed to deal with a large range of situations. 
Given a class of environments we want to define agents that can learn to act well 
when placed in any of them, assuming it is at all possible. 

Definition 29 (Universality for a class) We say that a decision maker is uni- 
versal for a class of environments Ai if for any outcome sequence dxiyaiji--- that 
given the actions would be produced by some environment in the class, there is c > 
(depending on the sequence) such that the decision maker has probabilities that sat- 
isfy 

Pr(j 1 ,...,j t \a 1 ,...,a t ) > c Vt 

This is obviously true if the decision maker's probabilistic beliefs are a convex com- 
bination J2 u eM Wuh, > w u > ^ and Y^ v w v = 1- 

We will next discuss how to define some large classes of environments and agents 
that can succeed for them. We assume that the total accumulated reward from 
the environment will be finite regardless of our actions since we want any policy to 
have finite utility. Furthermore, we assume that rewards are positive and that it is 
possible to achieve strictly positive rewards in any environment. We would like the 
agent to perform well regardless of which environment from the chosen class it is 
placed in. 

For any possible policy (action tree) it and environment v, there is a total reward 
VJ that following n in v would result in. This means that for any 7r there is a contract 
sequence {VJ) U , assuming we have enumerated our set of environments. Let 

V* = maxV*. 

7T 

We know that V* > for all v. Every contract sequence (Vj r ) u lies in X = 
£°°((1/V*)) and ||('V^ r )||x < 1- The rational decision makers are the positive, con- 
tinuous linear functionals on X. X' contains the space ^(V*). In other words if 
w u > and ^ w v V* < oo then the sequence (iu„) defines a rational decision maker 
for the contract space X. These are exactly the monotone rational decision makers. 
Letting (which is the AIXI agent from |Hut05j ) 

7r* G argmax^ w v V* (7) 
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we have a choice with the property that for any other tt with 

V V 

Hence the contract (V* — VJ 1 ") is not rejectable. In other words tt* is strictly 
preferable to it. By letting p v = w u V*, we can rewrite (J7J) as 

T/7T 

tt* G arg max Y^ Vv 777 • (8) 

v V » 

If one further restricts the class of environments by assuming V* < 1 for all v then 
for every tt, {V^) G £°°. Therefore, by Theorem [28] the monotone rational agents for 
this setting can be formulated as in © with (w v ) G £±, i.e. J2 u w u < 00. However, 
since (p u ) G ^1, a formulation of the form of (jHJ) is also possible. Normalizing p and 
w individually to probabilities makes (J2J) into a maximum expected utility criterion 
and (jSJ) into maximum relative utility. As long as our w and p relate the way they 
do it is still the same decisions. If we would base both expectations on the same 
probabilistic beliefs it would be different criteria. When we have an upper bound 
V* < b < 00 W we can always translate expected utility to expected relative utility 
in this way, while we need a lower bound < a < V* to rewrite an expected relative 
utility as an expected utility. Note, the different criteria will start to deviate from 
each other after updating the probabilistic beliefs. 

4.1 Asymptotic Optimality 

Denote a chosen countable class of environments by M.. Let VJ k be the rewards 
achieved after time k using policy tt in environment v. We suppress the dependence 

on the history so far. Let 

V\ 
w n , = -^ 

V u,k 

denote the skill (relative reward) of tt in environment v from time k. The maximum 
possible skill is 1. We would like to have a policy tt such that 

lim W? k = lVveM. 

A;— >oo ' 

This would mean that the agent asymptotically achieve maximum skill when placed 
in any environment from M.. Let I(h k , u) — 1 if v is consistent with history h\. and 
I(hk, v) = otherwise. Furthermore, let 

_ Pufi 

be the agent's weight for environment v at time k and let tt v be a policy that at 
time k acts according to a policy in 

argmaxVVfc-^. (9) 

n — V,, u 



u,k 
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In the following theorem, we prove that for every environment v G Ai, the policy 
tt p will asymptotically achieve perfect relative rewards. We have to assume that there 
exists a sequence of policies "Kk > with this property (as for the similar Theorem 
5.34 in [Hut05j which dealt with discounted values). The convergence in W- values 
is the relevant sense of optimality for our setting, since the V^-values converge to 
zero for any policy. 

Theorem 30 (Asymptotic optimality) Suppose that we have a decision maker 
that is universal (i. e. p v > \/u) with respect to the countable class M. of environ- 
ments (which can be stochastic) and that there exists policies tt^ such that for all 
v, W^ k,v — > 1 if v is the actual environment (or the sequence is consistent with v). 
This implies that Wj! ' M — > 1 where fj, is the actual environment. 

The proof technique is similar to that of Theorem 5.34 in [Hut05| . 
Proof. Let 

< 1 - W?" =: Aj, A k = 5>,, fc A*. (10) 

V 

The assumptions tells us that Aj; = W£ h ' v — 1 — > for all v that are consistent 
with the sequence (p u ,k — if v is inconsistent with the history at time k) and since 
A* < 1 , it follows that 

A fc = ^^, fe A^0. 

V 

Note that p M (l - Wfn < E,P^(1 - W^ k ) < EuP^ ~ W* h J = 
J2p u k^t = A fc . Since we also know that p^k > P^o > it follows that 
(1 - <>) -f 0. 

5 Conclusions 

We studied complete rational decision makers including the cases of actions that may 
affect the environment and sequential decision making. We set up simple common 
sense rationality axioms that imply that a complete rational decision maker has 
preferences that can be characterized as maximizing expected utility. Of particular 
interest is the countable case where our results follow from identifying the Banach 
space dual of the space of contracts. 
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